Goto

Collaborating Authors

 causal evidence


Interpretability as Alignment: Making Internal Understanding a Design Principle

Sengupta, Aadit, Seth, Pratinav, Sankarapu, Vinay Kumar

arXiv.org Artificial Intelligence

Frontier AI systems require governance mechanisms that can verify internal alignment, not just behavioral compliance. Private governance mechanisms audits, certification, insurance, and procurement are emerging to complement public regulation, but they require technical substrates that generate verifiable causal evidence about model behavior. This paper argues that mechanistic interpretability provides this substrate. We frame interpretability not as post-hoc explanation but as a design constraint embedding auditability, provenance, and bounded transparency within model architectures. Integrating causal abstraction theory and empirical benchmarks such as MIB and LoBOX, we outline how interpretability-first models can underpin private assurance pipelines and role-calibrated transparency frameworks. This reframing situates interpretability as infrastructure for private AI governance bridging the gap between technical reliability and institutional accountability.


Causal Evidence for the Primordiality of Colors in Trans-Neptunian Objects

Davis, Benjamin L., Ali-Dib, Mohamad, Zheng, Yujia, Jin, Zehao, Zhang, Kun, Macciò, Andrea Valerio

arXiv.org Artificial Intelligence

The origins of the colors of Trans-Neptunian Objects (TNOs) represent a crucial unresolved question, central to understanding the history of our Solar System. Recent observational surveys have revealed correlations between the eccentricity and inclination of TNOs and their colors. This has rekindled the long-standing debate on whether these colors reflect the conditions of TNO formation or their subsequent collisional evolution. In this study, we address this question with 98.7% certainty, using a model-agnostic, data-driven approach based on causal graphs. First, as a sanity check, we demonstrate how our model can replicate the currently accepted paradigms of TNOs' dynamical history, blindly and without any orbital modeling or physics-based assumptions. In fact, our causal model (with no knowledge of the existence of Neptune) predicts the existence of an unknown perturbing body, i.e., Neptune. We then show how this model predicts, with high certainty, that the color of TNOs is the root cause of their inclination distribution, rather than the other way around. This strongly suggests that the colors of TNOs reflect an underlying dynamical property, most likely their formation location. Moreover, our causal model excludes formation scenarios that invoke substantial color modification by subsequent irradiation. We therefore conclude that the colors of TNOs are predominantly primordial.


How Do Transformers "Do" Physics? Investigating the Simple Harmonic Oscillator

Kantamneni, Subhash, Liu, Ziming, Tegmark, Max

arXiv.org Artificial Intelligence

How do transformers model physics? Do transformers model systems with interpretable analytical solutions, or do they create "alien physics" that are difficult for humans to decipher? We take a step in demystifying this larger puzzle by investigating the simple harmonic oscillator (SHO), $\ddot{x}+2\gamma \dot{x}+\omega_0^2x=0$, one of the most fundamental systems in physics. Our goal is to identify the methods transformers use to model the SHO, and to do so we hypothesize and evaluate possible methods by analyzing the encoding of these methods' intermediates. We develop four criteria for the use of a method within the simple testbed of linear regression, where our method is $y = wx$ and our intermediate is $w$: (1) Can the intermediate be predicted from hidden states? (2) Is the intermediate's encoding quality correlated with model performance? (3) Can the majority of variance in hidden states be explained by the intermediate? (4) Can we intervene on hidden states to produce predictable outcomes? Armed with these two correlational (1,2), weak causal (3) and strong causal (4) criteria, we determine that transformers use known numerical methods to model trajectories of the simple harmonic oscillator, specifically the matrix exponential method. Our analysis framework can conveniently extend to high-dimensional linear systems and nonlinear systems, which we hope will help reveal the "world model" hidden in transformers.


Testosterone Can Make Men Feel Generous - Facts So Romantic

Nautilus

Testosterone gets a pretty bad reputation. It's been long known as the hormone of aggression. In his 1998 book, The Trouble With Testosterone: And Other Essays on the Biology of the Human Predicament, the neuroscientist Robert Sapolsky writes, "What evidence links testosterone with aggression? Some pretty obvious stuff": Males tend to have more testosterone than women, and tend to be more aggressive. "Times of life when males are swimming in testosterone (for example, after reaching puberty) correspond to when aggression peaks."